Kerri Anderson

Project 5

Credit Card User Churn

September 2021

AIML

The purpose of this project is to identify customers who are likely to cancel their credit card accounts. As credit cards are a strong source of income for banks, the bank seeks to identify persons who are likely to cancel their account, based on past trends, so they can take targeted actions to prevent current and future cusotmers from closing their accounts.

We will use historical data to identify patterns in customer profiles that align with account cancelations.

Several modeling techniques will be used and compared within this project. Model tuning will be used to optimize model performance. The goal is to select a model that fits the following parameters: "Recall on the test set is expected to be > 0.95, and precision and accuracy is expected to be > 0.70"

Another goal of the project is to prevent data leakage, and to successfully transform the datasets and divide them into training, testing, and validation datasets without data leakage.

Import Libraries

Import Data

Some object variables will need to be transformed into category variables or into numeric dummy or indicator variables in order to be usable by the model. We also will need to change the values for income as it is an ordered category variable. If transformations take place for the numeric variables we need to ensure data leakage does not occur.

Will need to account for missing data in education level and marital status

Data Pre-processing

We created the dummy variables for attrition flag. Keep the "attritted customer" indicator, as it will be equal to 1 if the customer has attritted. As the model looks to identify customers likely to atrit it is best to keep this variable and drop the "existing customer" indicator.

EDA

Univariate Analysis of Categorical Variables

Univariate Analysis of Continuous Variables

Bivariate Analysis

Notes on data preparation as a result of EDA:

Feature engineering

Given the patterns with post-graduate and doctorate degrees versus lower levels of education attained, I chose to group this into a binary variable for simplification

Split data

Split data into test, train, and validation datasets

Avoid data leakage when modifying data to account for missing values

After imputing missing values, we need to ensure the code ran sucessfully and as intended

Verify there are no more missing values in any of the datasets

Create dummy variables and drop the most frequent level to aid in model stability

Model evaluation criterion

Define TP, TN, FP, FN

Company could face 2 types of losses

  1. Could take action to prevent attrition when there is no risk - Loss of investment in time/money/resources when applying retention strategies when such strategies are not warranted
  2. Do not take action in the form of retention strategies for those that truly are at risk of attrition - Loss of future income due to attrition or closing credit card account

Which Loss is greater?

How to reduce this loss i.e need to reduce False Negatives?

Oversampling Data with SMOTE

Create oversampling dataset for model building and performance evaluation

Undersampling data

Create undersampling dataset for model building and performance evaluation

The following 6 models will be built using the original dataset, an oversampled dataset, and an undersampled dataset.

Model performance will be evaluated primarily on Recall, but should also consider a balance of performance with regard to accuracy, precision, and F1.

Each model will be built, performance measures will be printed, and the final comments will be available at the end of this block of code to compare all models side by side.

Models to evaluate:

Logistic Regression

Let's evaluate the model performance by using KFold and cross_val_score

Logistic Regression on oversampled data

Logistic Regression on undersampled data

The logistic regression model performed fairly well on the training dataset with oversampled and undersampled data

The logistic regression model perfomance doesn't look strong as the precision and F1 are fairly low even when the recall and accuracy are high - we should try more models

Decision Tree

Decision tree - original data

Decision Tree - Oversampled data

Decision Tree - Undersampled data

Random Forest

Random Forest - original data

Random Forest - Oversampled Data

Bagging Classifier

Bagging Classifier Original Dataset

Adaboost classifier

Adaboost original data

Adaboost oversampling

Adaboost undersampling

Gradient Boosting classifier

Gradient Boosting original data

Gradient Boosting oversampled

Gradient Boosting Undersampling

Model Performance Summary and Comparison

Observations from modeling:

As a next step in the process, the three best performing models will be tuned using random CV to identify the best hyperparameters for optimal model performance.

Model performance will be compared and comments will be made at the end of this block of code.

Random Forest Tuning

Tune random forest model on undersampled dataset

Adaboost Tuning

Random CV Search for Adaboost on Undersampled Data

Gradiant Boosting Tuning

Tuned model performance comparison

Compare model performance of the three tuned models on the validation dataset

The performance was strongest on the GBM tuned on the undersampled dataset. There is a healthy balance of strong recall, as well as high accuracy and precision.

Evaluate Model Performance on Test Dataset

The tuned GMB model als performed well on the Test dataset. I would recommend this model move forward to help identify customer who were at risk of attrition. Based up on the variables deemed to be important, the company can develop plans to implement customer-specific strategies to reduce attrition.

Evaluate variable importance

Final Recommendation

Pipeline